Classification with Linear Discriminant Analysis (LDA)

Our scenario

We have existing observations

\[(x_1, C_1), ... (x_n, C_n)\]

where the \(C_i\) are categories.

Given a new observation \(x_{new}\), how do we predict \(C_{new}\)?


LDA Come up with a “cutoff”: if \(x_{new} >\) cutoff, predict class A, if not, predict class B.

A simple example

How about cutting in the middle?

dat %>%
  mutate(
    pred_class = if_else(
      val > 11, 
      "B",
      "A"
    )
  ) %>%
  count(Class, pred_class)
# A tibble: 4 × 3
  Class pred_class     n
  <chr> <chr>      <int>
1 A     A            839
2 A     B            161
3 B     A            166
4 B     B            834
  • What’s our sensitivity?
  • What’s our specificity?

What if we move the cutoff?

In what scenario would we choose an “uneven” cutoff?

How’d we do?

dat %>%
  mutate(
    pred_class = if_else(
      val > 10, 
      "B",
      "A"
    )
  ) %>%
  count(Class, pred_class)
# A tibble: 4 × 3
  Class pred_class     n
  <chr> <chr>      <int>
1 A     A            515
2 A     B            485
3 B     A             22
4 B     B            978

LDA, in General

To perform classification with Linear Discriminant Analysis, we choose the best dividing line between the two classes.


The Big Questions

  • What is our definition of best?

  • What if we allow the line to “wiggle”?

Example

Let’s keep hanging out with the insurance dataset.

Suppose we want to use information about insurance charges to predict whether someone is a smoker or not.

ins <- read_csv("https://www.dropbox.com/s/bocjjyo1ehr5auz/insurance.csv?dl=1")

ins <- ins %>%
  mutate(
    smoker = factor(smoker)
  ) %>%
  drop_na(smoker)

Quick Quiz

What do we have to change?

The model?

The recipe?

The workflow?

The fit?

Step 1: Change the model!

lda_mod <- discrim_linear() %>%
  set_engine("MASS") %>%
  set_mode("classification")

Step 2: Fit our model!

lda_fit_1 <- lda_mod %>%
  fit(smoker ~ charges, 
      data = ins)
lda_fit_1 
parsnip model object

Call:
lda(smoker ~ charges, data = data)

Prior probabilities of groups:
    no    yes 
0.7981 0.2019 

Group means:
    charges
no     7528
yes   31152

Coefficients of linear discriminants:
            LD1
charges 0.00014

Visualizing the decision boundary

Step 3: Make some predictions!

preds <- predict(lda_fit_1, 
                 new_data = ins)

ins <- ins %>%
  mutate(
    pred_smoker = preds$.pred_class
  )

ins %>%
  accuracy(truth = smoker,
           estimate = pred_smoker)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.921

What if we want to use more than one predictor?

Adding age as a predictor

lda_fit_2 <- lda_mod %>%
  fit(smoker ~ charges + age, data = ins)

lda_fit_2
parsnip model object

Call:
lda(smoker ~ charges + age, data = data)

Prior probabilities of groups:
    no    yes 
0.7981 0.2019 

Group means:
    charges   age
no     7528 38.30
yes   31152 36.62

Coefficients of linear discriminants:
               LD1
charges  0.0001718
age     -0.0449953

Finding classification equation

lda_fit_2$fit$scaling
               LD1
charges  0.0001718
age     -0.0449953


\[\text{Score} = 0.001718 \times \text{charges} -0.0444 \times \text{age}\]

Predict “smoker” if Score > 0

Finding our boundary

\[\text{Score} = 0.001718 \times \text{charges} -0.0444 \times \text{age}\]

\[0 = 0.001718 \times \text{charges} -0.0444 \times \text{age}\]

\[\text{age} = \frac{0.001718}{0.0444} \times \text{charges}\]

\[\text{age} = 0.03869 \times charges\]

Let’s Plot it!

coefficients <- lda_fit_2$fit$scaling

my_slope = coefficients[1] / 
              (-1 *coefficients[2])


ins %>%
  ggplot(mapping = aes(x = charges, y = age, color = smoker)) +
  geom_point() +
  geom_abline(mapping = aes(slope = my_slope, intercept = 0), 
              lwd = 1.5) +
  scale_x_continuous(labels = label_dollar()) +
  labs(x = "Medical Charges", 
       y = "Age", 
       color = "Smoking Status")

Let’s Plot it!

Try it! (you know the drill…)

Open Activity-Classification-LDA.qmd

Select the best LDA model for predicting smoker status.

Compare the accuracy to your KNN and Logistic Regression models (from last class).

Quadratic Discriminant Analysis

One more time: wiggly style

QDA

What if we allow the separating line to be non-linear?

qda_mod <- discrim_regularized(frac_common_cov = 0) %>% 
  set_engine('klaR') %>% 
  set_mode('classification')

In this case, we allow the data in the different categories to have different variances.

An example scenario

Let’s visualize the relationship

Visualizing the decision boundary

Try it!

Open Activity-Classification-LDA.qmd again

Select the best QDA model

Compare to prior models